purrr TutorialDownload .Rmd (won’t work in Safari or IE)
See GitHub Repository
#purrr
In my opinion, purrr is one of the most underrated and under-utilized R packages. It has completely revolutionized my own efficiency and workspace organization, particularly as someone who works with super messy data that comes in a variety of forms.
In this tutorial, we are going to cover a number of what I believe are the most functional and important applications of purrr in psychological research. Given the audience, in the first half of the tutorial, I will focus on working with the diverse forms of data that many of you work with, providing examples of how to load, clean, and merge data using purrr. In the second half, I will focus on how we can use purrr with longitudinal data analysis when we are working with multiple predictors and outcomes.
Before we get there, though, I think it’s useful to think about when and where we would use purrr.
Iteration is everywhere. It underpins much of mathematics and statistics. If you’ve ever seen the \(\Sigma\) symbol, then you’ve seen (and probably used) iteration.
It’s also incredibly useful. Anytime you have to repeat some sort of action many times, iteration is your best friend. In psychology, this often means reading in a bunch of individual data files from an experiment, repeating an analysis with a series of different predictors or outcomes, or creating a series of figures.
library(psych)
library(knitr)
library(kableExtra)
library(lme4)
library(broom.mixed)
library(plyr)
library(tidyverse)Enter for loops. for loops are the “OG” form of iteration in computer science. The basic syntax is below. Basically, we can use a for loop to loop through and print a series of things.
## [1] "a"
## [1] "b"
## [1] "c"
## [1] "d"
## [1] "e"
The code above “loops” through 5 times, printing the iteration letter.
Essentially, like the apply(), lapply(), sapply(), and mapply() family of functions, purrr is meant to be an alternative to iteration (i.e. for loops) in R. for loops are great, but they aren’t as great in R as they are in other programming languages. In R, you’re better off vectorizing or building in C++ backends.
There are a lot of functions in the purrr package that I encourage you to check out. Today, though, we’ll focus on the map() family of functions. The breakdown of map functions is pretty intuitive. The basic map function wants two things as input – a list or vector and a function. So the purrr equivalent of the example above would be:
## [1] "a"
## [1] "b"
## [1] "c"
## [1] "d"
## [1] "e"
## [[1]]
## [1] "a"
##
## [[2]]
## [1] "b"
##
## [[3]]
## [1] "c"
##
## [[4]]
## [1] "d"
##
## [[5]]
## [1] "e"
Note that this returns a list, which we may not always want. With purrr, we can change the kind of output of map() by adding a predicate, like lgl, dbl, chr, and df. So in the example above, we may have wanted just the characters to print. To do that we’d call map_chr():
## [1] "a"
## [1] "b"
## [1] "c"
## [1] "d"
## [1] "e"
## [1] "a" "b" "c" "d" "e"
Note that it also returns the concatenated character vector as well as printing each letter individually (i.e. iteratively).
map() functions can also hand multiple inputs. Often we may need to input multiple pieces of information to a function, similarly to how we work with nested for loops. In this case, we have map2() and pmap() that take additional arguments. map2() shockingly takes two inputs and pmap() takes p arguments that you feed in as list (e.g. pmap(list(a, b, c, d), my_fun)). A simple printing example would be:
## [1] "a 1" "b 2" "c 3" "d 4" "e 5"
Note here that we can use map2() and pmap() with the predicates from above.
This likely makes little sense at this point, and that’s fine. The examples in the rest of this tutorial should elucidate their usage. The last note I’ll make is that thinking about the structure of your data is going to be very important when using purrr. To use it effectively, you’ll need your data in specific forms, which will often require data manipulations. It just takes practice.
Regardless of the programmatic form, iteration is everywhere. It underpins much of mathematics and statistics. If you’ve ever seen the \(\Sigma\) symbol, then you’ve seen (and probably used) iteration.
It’s also incredibly useful. Anytime you have to repeat some sort of action many times, iteration is your best friend. In psychology, this could mean reading in a bunch of separate data files (with separate files for different people, variables, waves, etc.) or performing a number of regressions or other statistical tests.
To demonstrate the first case in which I find purrr useful, we are going to consider a five cases that, in my experience, capture many of the challenges we often face in working with psychological data. In each of these cases, we will use a codebook of the form we discussed in the previous tutorial on codebooks.
All of these share a similar feature: multiple files. There are a variety of other techniques you could use to get your data into a usable form, such as those below:
But let’s not do that. Let’s use iteration to make our process efficient and transparent.
We will start with a data storage format that is very common in experimental studies in various fields of psychology as well as in observational studies of repeated assessments of individuals (i.e. ESM, EMA, etc.).
For this first example, I’ll show you how this would look with a for loop before I show you how it looks with purrr.
Assuming you have all the data in a single folder and the format is reasonably similar, you have the following basic syntax:
data_path <- ""
files <- list.files(data_path)
data <- list()
for(i in files){
data[[i]] <- read.csv(i, stringsAsFactors = F)
}
data <- combine(data)This works fine in this simple case, but where purrr really shines in when you need to make modifications to your data before combining, whether this be recoding, removing missing cases, or renaming variables.
But first, the simple case of reading data.
data_path <- "~/Documents/week_3_purrr"
df1 <- tibble(ID = list.files(sprintf("%s/data/example_1", data_path))) %>%
mutate(path = sprintf("%s/data/example_1/%s", data_path, ID),
data = map(path, read_csv),
ID = str_remove(ID, ".csv")) %>%
unnest(data) %>%
select(-path)The code above creates a list of ID’s from the data path (files named for each person), reads the data in using the map() function from purrr, removes the “.csv” from the ID variable, then unnests the data, resulting in a data frame for each person.
But often, we have variable names that aren’t super informative, so we want to rename them. In this case, we need to use our codebook to give them more informative variable names.
In this case, where all people have the same variables, it’s easiest to just rename them after unnesting, so the full code would look like this:
data_path <- "https://github.com/emoriebeck/R-tutorials/raw/master"
(codebook <- sprintf("%s/ALDA/week_3_purrr/data/codebook_ex1.csv", data_path) %>% read_csv)old.names <- codebook$old_name
new.names <- codebook$new_name
df1 <- tibble(ID = list.files(sprintf("%s/data/example_1", data_path))) %>%
mutate(path = sprintf("%s/data/example_1/%s", data_path, ID),
data = map(path, read_csv),
ID = str_remove(ID, ".csv"))%>%
unnest(data) %>%
select(ID, old.names) %>%
setNames(c("ID", new.names))In some cases, participants may have different variables. This could be do to a skip rule in a study or intentionally different variable collection (e.g. in between-person experiments or idiographic work like I do). In this case, we might need to filter or rename variables within our iterative loop.
In this case, all participants have the same set of core variables but were randomly assigned to complete one additional scale.
In some cases, instead of multiple files for each participant, we collect a single file for all participants across different waves (e.g. using Qualtrics). In this case, we need to index the files a little differently. Instead of reading in files for participants, we need to read in files for waves, which may be named in a variety of ways.
Here, I’ll start with a simple example of data that were well-managed and nicely named the same except for wave content. This is a good practice to do. I’m in general against modifying data, but I am a fan of changing file names because I think this actually helps with data management and prevents the need to actually go in and modify information within files.
These data come from a longitudinal study of personality. We have seven waves, and the variable names for all items are consistent across waves. In this case, our code is almost identical to reading in multiple files for each participant, except that now we have wave info and will need to toss out part of the file names at the end.
codebook <- sprintf("%s/ALDA/week_3_purrr/data/codebook_ex3.csv", data_path) %>% read_csv
old.names <- str_remove_all(codebook$old_name, "[ ]")
new.names <- codebook$new_name
df3 <- tibble(wave = paste("T", 1:7, sep = ""),
path = sprintf("%s/ALDA/week_3_purrr/data/example_3/%s.csv", data_path, wave)) %>%
mutate(data = map(path, read_csv),
wave = as.numeric(str_extract_all(wave, "[0-9]"))) %>%
select(-path) %>%
unnest(data) %>%
select(old.names) %>%
setNames(new.names)The only change from the code for reading in multiple files for participants is that we have “wave” as a variable instead of “ID” and we use the str_extract_all() function from the stringr package (part of tidyverse) to get rid of everything except the numeric wave value.
Oftentimes, however, we do not have the same variables across waves or they do have the same names across waves. In those cases, we’ll have to do a little extra work to get our data into a form where we can unnest() them – that is where shared column names will actually be shared.
We’ll start with the case where we have some additional information (e.g. demographics) in the first wave.
These data are the same as we used in the previous example except that I changed the names and added demographic information for this example. This means that we have slightly different information in wave one and need a way to match the same variables across waves. We’ll use our codebook to achieve this with little issue!
However, because of this, we’ll need to use a function that take the year as input, so that we pull the correct variables from the codebook.
read_fun <- function(Wave){
old.names <- str_remove_all((codebook %>% filter(wave == "All" | wave == Wave))$old_name, "[ ]")
new.names <- (codebook %>% filter(wave == "All" | wave == Wave))$new_name
sprintf("%s/ALDA/week_3_purrr/data/example_4/T%s.csv", data_path, Wave) %>%
read_csv() %>%
select(old.names) %>%
setNames(new.names) %>%
gather(key = item, value = value, -SID)
}
codebook <- sprintf("%s/ALDA/week_3_purrr/data/codebook_ex4.csv", data_path) %>% read_csv
df4 <- tibble(wave = 1:7) %>%
mutate(data = map(wave, read_fun)) %>%
unnest(data) %>%
unite(tmp, item, wave, sep = ".") %>%
spread(tmp, value) %>%
gather(key = item, value = value, -SID, -contains("Dem")) %>%
separate(item, c("item", "wave"), sep = "[.]") %>%
spread(item, value) In other cases, we may have multiple types of files for different waves. Across waves, those variables may be the same or different, but we’ll focus on the case when we largely want the same variables.
Another really powerful feature of purrr is keeping your data, models, tables, plots, etc all conveniently indexed together. Often we need to do this for multiple DV’s or predictors, and you may end up with an environment that looks something like E_fit1, A_fit1, E_fit2, A_fit2 and so on. There’s nothing wrong with this. But eventually you’ll want to pull out coefficients, plot results, etc., and it’s easy to make a copy and paste error or name different types of objects inconsistently, which can be difficult both for future you or someone else using your code.
Before we can learn how to use purrr for this, we need to understand what a nested data frame is. If you’ve ever worked with a list in R, you are halfway there. Basically a nested data frame takes the normal data frame you are probably familiar with and adds some new features. It still has columns, rows, and cells, but what makes up those cells isn’t restrictred to numbers, strings, or logicals. Instead, you can put essentially anything you want: lists, models, data frames, plots, etc!
If this sounds scary, it will hopefully become clearer if we use our read in data from above to run, table, and plot some basic longitudinal models of our data.
codebook <- sprintf("%s/ALDA/week_3_purrr/data/codebook_ex6.csv", data_path) %>%
read_csv %>%
mutate(old_name = str_to_lower(old_name))
read_fun <- function(Year){
old.names <- (codebook %>% filter(year == Year | year == 0))$old_name
new.names <- (codebook %>% filter(year == Year | year == 0))$new_name
set <- (codebook %>% filter(year == Year))$dataset[1]
sprintf("%s/ALDA/week_3_purrr/data/example_6/%s.csv", data_path, set) %>%
read_csv %>%
select(old.names) %>%
setNames(new.names)
}
(df6 <- tibble(year = 2005:2015) %>%
mutate(data = map(year, read_fun)) %>%
select(-year) %>%
unnest(data) )